How to Dedupe 


Rows in SQL 
WITHOUT Using 
DISTINCT 


Duplicate rows can 
be a nightmare 


for data 


analysts. 


In this guide, | 
will show you 
an elegant 
way to fix 


them. 


The DISTINCT 
keyword Is often 
used to remove 


duplicates. 


Suppose you have a table 
named animals with the 


following data... 


id | species | color 


| dog | brown 


| cat | black 
| dog | white Dupes 


| cat | white 


| dog | brown 


To dedupe with DISTINCT, 


you would do this... 


SELECT DISTINCT species, color 
FROM animals 


species color 


dog brown 
dog white 
cat black 
cat white 


Output: 


But what if your data 


Looks like this... 


And you want the 


most recent updated 


value for each? 


Enter Window Functions! 


The window function we 
will use is called: 
ROW _NUMBER() 


Using ROW_NUMBER() 
allows us to be more 
precise with our 


deduplication. 


To use ROW_NUMBER() 
Start with this... 


* 
, ROW_NUMBER() OVER ( species, color 
updated_at ) row_num 
animals 


row_num 


where row_num=1 


WITH cte AS ( 
SELEC 


, ROW_NUMBER() OVER (PARTITION BY species, color 
ORDER BY updated_at DESC) AS row_num 


FROM animals 
) 
SELEC 
FROM cte 
WHERE row_num 1 


Then throw itina CTE 


Deduplicated Output: 


id 


Species 


dog 
cat 
dog 
cat 


updated_at 


2022-04 02 
2022701702 
2022 0r 05 
20222071705 


12:00:00 
12:00:00 
12:00:00 
12:90:00 


Practice using 
ROW_NUMBER() 


and you will have it 


mastered in 


no time. vi 


Duplicates can lead 
to inaccuracies in 


your data. 


Now you know a 
cool way to fix 


them! 
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